Method for automatic extraction of data from graph

ABSTRACT

A method for automatic extraction of data from a graph, including text area locating and text box classification; locating of coordinate axes, and locating of the positions of hatch marks on the coordinate axes; legend locating and information extraction; extracting corresponding bar or polyline connected components according to legend color, and filtering and classification; determining key points on the X-axis and locating a corresponding X-axis label for each key point; locating key points of the bars and polyline according to the X-axis key points, determining labeled numerical text boxes that correspond to the key points, and identifying the numerical text; calculating a corresponding value for each pixel, and estimating corresponding values of the key points of the bars or polyline; determining a final result according to a difference between the estimated values and the recognized labeled values.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No.201910972334.8, filed Oct. 14, 2019.

TECHNICAL FIELD

The present disclosure relates to the fields of computer imageprocessing and pattern recognition, and in particular to a method forautomatic extraction of data from a graph.

BACKGROUND

Bar charts, line charts and the like are a more intuitive way ofpresenting data, and are widely used in various industries, especiallyin finance, scientific research, statistics, etc. However, in dailywork, when one wants to cite, in their report or article, data containedwithin a bar chart or line chart made by others, it becomes difficult toobtain the data contained within the graph because the original data isnot available. A simple way is to obtain data intuitively byobservation, estimation, measurement, etc. But this intuitive estimationis imprecise and inefficient. It would be of great benefit if there wasan automated data extraction method that could improve the efficiencyand precision of data acquisition.

Currently, there is a semi-automatic method for obtaining data from barcharts and line charts. This method mainly includes manually labelingminimum and maximum positions on the coordinate axes and correspondingvalues, height positions of the bars and key points on the polyline toobtain the values at specific positions on the X-axis. Representativesoftware adopting this method includes GetData and Engauge Digitizer.

Bar charts and line charts are a representation of data according to acertain rule, in the form of images. In essence, an image is acollection of pixels, each pixel having a RGB value and arranged at adifferent position. Image analysis mainly includes using an imageprocessing method to identify information such as the bounding rectangleof the bar, color, and the position of the polyline, in order to furtherextract data.

In the present disclosure, bar charts, line charts and mixed graphs ofbar charts and line charts are collectively referred to as “datagraphs,” i.e., graphs generated from data. A data graph may involvevarious standard elements, including: X-axis, Y-axis, hatch mark values,hatch mark lines, hatch marks, X-axis labels, X-axis label intervals,coordinate axes, legends, etc. In order to facilitate understanding ofthe various elements of a data graph, FIG. 1 shows an exemplary labeledgraph. In addition, the bar chart has a bar foreground and the linechart has a line foreground, and key points on the bars or lines areprovided with labeled values. These key elements may be omitted from thedata graphs in some cases, and the layout may change. Generallyspeaking, there may be two Y-axes, on the left and right respectively;the bars may be in a horizontal direction; the legends may be indifferent places; Y-axis hatch mark values may be missing. Theseelements of the data graphs are interdependent, i.e., there arepositional relationships between one another. Generally speaking, hatchmark values of a left Y-axis must exist to the left of the Y-axis; hatchmark values of a right Y-axis must exist to the right of the Y-axis;hatch mark values of an X-axis must exist below the X-axis; the legendgenerally consists of solid blocks, lines or dots, different parts ofthe same legend are of the same color, and there are text characters tothe right of the legend.

Technologies and methods relating to image processing and patternrecognition are needed to locate and identify these key elements.However, automatic extraction of data from a graph is a technicalproblem that needs to be solved urgently.

SUMMARY OF PARTICULAR EMBODIMENTS

An object of the present disclosure is to provide a method for automaticextraction of data from a graph, in view of the low efficiency problemin the prior art using semi-automatic data extraction.

In some embodiments, the method of the present disclosure uses a deeplearning method to locate text boxes in a data graph and performcharacter recognition, then extracts all kinds of other elementsaccording to a certain order and rules, verifies whether the locatingand identification of the elements are correct using a positionalcorrelation between the elements, and finally calculates the height ofkey points on the bars and the line chart, obtains the estimated valueof the key points of the bars or the polylines by an obtainedcorresponding value of each pixel on the coordinate axis, and comparesit with an identified value, to determine an optimal result.

The present disclosure provides at least the following technicalsolutions:

A method for automatic extraction of data from a graph, for extractingelement data from a data graph with bars or polylines, comprising thesteps of:

S1: text area locating and text box classification in data graphaccording to steps S11 to S15:

S11: obtaining a data graph where data is to be extracted, locating alltext boxes within the data graph by deep learning, and performingcharacter recognition;

S12: counting the number of text boxes at each position in X directionof the data graph, to obtain an array of numbers of text boxes atdifferent positions in X direction; obtaining a local maximum of thenumbers of text boxes in the array and a corresponding position;obtaining the difference between an average number of text boxes in amiddle area in X direction and a local maximum of the array, anddetermining there is a Y-axis hatch mark value text box at thecorresponding position of the local maximum if the difference is withinan threshold; and determining all text boxes at the correspondingposition of the local maximum as the Y-axis hatch mark value text boxesaccording to the position, to obtain a Y-axis hatch mark value text boxlist;

S13: performing a text box spacing consistency test on the Y-axis hatchmark value text box list using a noise data filtering method, with thespacing between adjacent text boxes being a filtering condition;

S14: obtaining an X-axis hatch mark value text box list according to amethod similar to S12 and S13;

S15: identifying graph title text in a graph title text box according tosize characteristics of the graph title text box and positionaldistribution characteristics in the data graph;

S2: locating of coordinate axes, and locating of the positions of hatchmarks on the coordinate axes according to steps S21 to S22:

S21: locating coordinate axes from the data graph, which comprises:

first calculating a horizontal gradient and a vertical gradient of thedata graph, respectively, and determining vertical edge pixels andhorizontal edge pixels according to a horizontal gradient result and avertical gradient result, respectively;

then counting the number of consecutive edge pixels in each column andthe number of consecutive edge pixels in each row, determining an edgepixel column whose number of consecutive edge pixels exceeds a setthreshold as a candidate Y-axis, and determining an edge pixel row whosenumber of consecutive edge pixels exceeds a set threshold as a candidateX-axis;

then merging adjacent candidate coordinate axes whose distance is lessthan a distance threshold;

finally determining the coordinate axis and axis hatch mark value textbox lists according to a positional relationship between candidatecoordinate axes and candidate axis hatch mark value text box lists;

S22: locating the positions of hatch marks on X-axis and Y-axissequentially, where each of the coordinate axes is located by:

first extracting a coordinate axis area image centered at a coordinateaxis, where the width of the area image in a direction vertical to thecoordinate axis covers the entire coordinate axis and hatch marks on thecoordinate axis;

then binarizing the coordinate axis area image, where the coordinateaxis and the hatch marks on the coordinate axis are foreground;

then counting foreground pixels in the binarized image in a directionvertical to the coordinate axis in a row-by-row or column-by-columnmanner;

then obtaining a local maximum of an array obtained from the counting,as the position of a candidate hatch mark;

finally filtering the obtained candidate hatch mark by a noise datafiltering method, to obtain an actual hatch mark on the coordinate axis;

S3: legend locating and information extraction according to steps S31 toS36:

S31: performing connected component analysis by calculating colordistances between adjacent pixels, to find all connected components withsimilar colors in the data graph; obtaining an average color value foreach connected component, as the color of the connected component; andcounting the number of pixels in the connected component and boundingrectangles;

S32: filtering all the connected components according to height, width,number of pixels, aspect ratio and compactness of the connectedcomponents using a threshold method, to obtain a candidate legendmeeting a legend requirement;

S33: scanning all possible candidate legend connected components inpairs, so that two connected components meeting color and heightconsistency requirements are combined into a new candidate legend;

S34: performing S31 to S33 on each of the three areas of the data graph:above, to the right of, and below a data area, to obtain all candidatelegends in these three areas; selecting candidate legends in an areawith the largest number of candidate legends as actual legends of thedata graph according to the number of candidate legends in each of thethree areas;

S35: performing layout analysis on the obtained actual legends accordingto spatial positions of the legends, to determine whether the legends inthe data graph are arranged in a vertical, horizontal or hybrid layout;and filtering out legends nonconforming with the layout;

S36: according to the layout of the legends, searching for acorresponding legend text box for each legend from the data graph, andidentifying text characters and character color from each legend textbox;

S4: extracting corresponding bar or polyline connected componentsaccording to legend color, and filtering and classification according tosteps S41 to S45:

S41: combining background color, character color in the text and legendcolor into a color list of a variety of color classes; scanning thepixels in the data area of the data graph, calculating color distancesbetween the color of a pixel and the variety of colors in the colorlist, and determining a color class having the smallest color distanceas the class of the pixel;

S42: performing connected component analysis on pixels of each class,filtering the connected components by a threshold method, to obtain acorresponding set of connected components for each legend in the dataarea;

S43: based on the height, width, number of pixels and compactness of theconnected components, scanning all connected component sets according toa threshold, to determine for each connected component whether theconnected component is a bar, and if it is a bar, calculating thevariance of the heights of all bars and the variance of the widths ofall the bars in the graph, determining whether the bars in the bar chartare horizontal or vertical according to the variances, and calculatingthe width of the bars; if there is no bar, determining that the datagraph is a line chart, with a vertical layout;

S44: according to the layout direction type of the data graph,identifying for each legend whether a connected component setcorresponding to the legend is a bar or a polyline, and determining aclassification axis and a numerical axis in the data graph;

S45: for all the connected components corresponding to legendsidentified as bars, selecting a bar whose width meets the bar widthdescribed in S43 as a candidate bar for the legend, and analyzing thespatial positions and distances of all the bars to identify whetherthere is a bar divided by a polyline into two connected components, andif there is, recombining them into one;

S5: determining key points on the classification axis and locating acorresponding classification-axis label for each classification-axis keypoint according to the layout direction type of the data graph;

S6: locating key data points of the bars or polyline according to theclassification-axis key points, determining a corresponding labelednumerical text box for each key data point, and identifying thenumerical text;

S7: calculating a corresponding value for each pixel according to thenumerical axis, and estimating corresponding values for the key pointsof the bars or polyline;

S8: for each key data point in the data graph, performing errorverification on the identified numerical value by the estimated value,to determine a final result.

Preferably, the noise data filtering method comprises:

comparing all the data to be filtered in pairs, to find a data pair withthe smallest value difference corresponding to the filter condition; ifthe value difference meets an error requirement, calculating the averageof the data pair and determining the average of the data pair as astandard value; then calculating a difference between each of the restof the data to be filtered and the standard value, and filtering outdata with a difference exceeding a threshold.

Preferably, in step S12, if there are Y-axis hatch mark value text boxeson both sides of the data graph, it is determined that there are twoY-axes, left and right; and a left Y-axis hatch mark value text box listand a right Y-axis hatch mark value text box list are obtained.

Preferably, in step S33, the new legend has a bounding rectanglecomposed of the bounding rectangles of the two connected components, thenumber of pixels is the sum of the pixels of the two connectedcomponents, and the color is the average of the colors of the twoconnected components.

Preferably, in step S41, if there is no legend in the data graph, pixelsin the data area with a non-background and non-character color are of aforeground class.

Preferably, step S44, the determining a classification axis and anumerical axis in the data graph comprises:

when the data graph is in a vertical layout, determining the X-axis asthe classification axis and the Y-axis as the numerical axis; when thedata graph is in a horizontal layout, determining the Y-axis as theclassification axis and the X-axis as the numerical axis.

Preferably, step S5 comprises:

S51: if there are hatch marks on the classification axis, sorting thembased on their positions according to the obtained hatch marks on theclassification axis; and determining a middle point of two adjacenthatch marks as a classification axis key point;

S52: if there is no hatch mark on the classification axis, determining amiddle point between classification axis hatch mark value text boxes asa classification axis key point; and filtering the obtainedclassification axis key points by a noise data filtering method.

Preferably, step S6 comprises:

S61: determining key data points of the bars or polyline respectively,where the key data point of a vertical bar is a middle point of the topedge of the bar, and the key data point of a horizontal bar is a middlepoint of the far right of the bar, and the key data points of a polylineare the data points on the polyline vertically corresponding to the keypoints of the classification axis;

S62: according to the position of each key data point, the layout of thedata graph and the positions of the text boxes in the graph, searchingfor a corresponding labeled numerical text box for each key data point;

S63: identifying a labeled numerical value in each labeled numericaltext box.

Preferably, step S7 comprises:

S71: matching according to the positional relationships between thehatch marks on the numerical axis and the labeled numerical text boxeson the numerical axis, and identifying the values in the labelednumerical text boxes on the numerical axis;

S72: for any two adjacent hatch marks on the numerical axis, calculatinga corresponding value for each pixel according to the difference betweenthe number of pixels between two hatch marks and the difference betweenthe values in the corresponding labeled numerical text boxes, where thecalculated values form a single pixel corresponding value list;

S73: filtering out noise from the single pixel corresponding value listby a noise data filtering method;

S74: calculating an average value of the single pixel correspondingvalue list after the noise filtering, as a final corresponding value Mof the single pixel;

S75: according to the obtained corresponding value M of the single pixeland the bar height H of the key data points, calculating an estimatedvalue for each key data point, where the bar height H of a data graph ina vertical layout is the distance from the key data point to the X-axis,and the bar height H of a data graph in a horizontal layout is thedistance from the key data point to the left Y axis.

Preferably, step S8 comprises:

for each key data point in the data graph, comparing the labeled valueobtained by step 63 with the estimated value obtained by step 75, and ifwithin an error range, determining the recognition result correct anddetermining the labeled value as the value of the key point; otherwise,determining the estimated value as the value of the key point.

Compared with the prior art, the present disclosure has the followingbeneficial effects:

The present disclosure can automatically extract data contained withindata graphs (bar charts and line graphs). The method of the presentdisclosure is applicable to most bar charts, line charts and mixed datagraphs of the two, and has a high recognition accuracy and speed. Inaddition, the method can directly store the data within the chart in anExcel format, which includes most of the information, including thelegend, X-axis labels and data series.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates exemplary elements of a data graph;

FIG. 2 is a flow chart of a method according to the present disclosure;

FIG. 3 shows an inputted original data graph;

FIG. 4 shows a result from text locating;

FIG. 5 shows a result from vertical projection of text boxes;

FIG. 6 shows a result from classification of the text boxes;

FIG. 7 shows a candidate Y-axis;

FIG. 8 shows the process of locating Y-axis hatch marks, specifically,a) is the extracted coordinate axis, b is the result from binarization,c) is the result from horizontal projection of foreground pixels, and d)is the result from the locating (circles);

FIG. 9 shows a result from legend locating (black rectangular frame);

FIG. 10 shows a located legend text box (color of the box represents thelegend it belongs to);

FIG. 11 shows a result from extracting foreground pixels according tothe legend color (foreground color is the corresponding legend color);

FIG. 12 shows locating key points on the X axis (indicated by smallcircles);

FIG. 13 shows the located bars, key points of the polyline and(indicated by small circles) and corresponding labeled text boxes (colorof the rectangular frame corresponds to the legend color);

FIG. 14 shows exemplary input data graph and corresponding outputresult.

DETAILED DESCRIPTION OF PARTICULAR EMBODIMENTS

The present disclosure will be further described below with reference tothe drawings and specific embodiments. The technical features of thevarious embodiments of the present disclosure can be combined accordingto actual needs if there is no conflict.

In view of the problem of extracting data contained within a data graph(bar graph or line graph), the present disclosure provides a method forautomatic extraction of data from a graph based on image processing andrecognition. This method can effectively extract data from mostconventional data graphs.

The specific technical solutions used in the present disclosure aredescribed as follows.

A method for automatic extraction of data from a graph includes thefollowing steps:

(1) text area locating and text box classification in a data graph;

(2) locating of coordinate axes, and locating of the positions of hatchmarks on the coordinate axes;

(3) legend locating and information extraction;

(4) extracting corresponding bar or polyline connected componentsaccording to legend color, and filtering and classification;

(5) determining key points on the X-axis and locating a correspondingX-axis label for each key point;

(6) locating key points of the bars and polyline according to the X-axiskey points, determining labeled numerical text boxes that correspond tothe key points, and identifying the numerical text;

(7) calculating a corresponding value for each pixel, and estimatingcorresponding values of the key points of the bars or polyline;

(8) determining a final result according to a difference between theestimated values and the recognized labeled values, and storing.

The above 8 steps may form a continuous process of obtaining variouselement information from a data graph, and the steps may be dependent onone another. For example, text box locating is a prerequisite for textbox classification, and also a basis for subsequent data identification.The steps are described in detail below.

1. Text Area Locating and Text Box Classification in Data Graph

Text area locating and text box classification refer to determining arectangular box where text is using an existing text locating method,and classifying the text box according to a positional relationship ofthe text box and relationships between the text box and other objects.In the present disclosure, the text box may be any of the 6 types:Y-axis hatch mark value text box (left or right), X-axis label text box,graph title, labeled numerical text box, and other text boxes. In thepresent disclosure, text box layout characteristics mainly include:there are more Y-axis text boxes in a certain position area than in themiddle position area in the X direction of the image; there are moreX-axis label text boxes in a certain position area than in otherposition areas in the Y direction of the image; vertical and horizontalprojections of the text boxes can be used to locate: position areas ofthe Y-axis text boxes in the X direction of the image, and positionareas of the X-axis label text boxes in the Y direction of the image.Y-axis text boxes and X-axis label text boxes are determined using thelocated positions of X-axis and Y-axis. The specific process ofclassification is as follows:

Before proceeding to the next step, for the data graph where data is tobe extracted, all text boxes within the data graph are located by deeplearning, and character recognition is performed.

1-1: Upon obtaining of all the text boxes, coordinate axis hatch markvalue text boxes are identified. Because Y-axis hatch mark value textboxes are distributed on the left or right side of the data graph, thepositions of the text boxes are all in a certain interval in the Xdirection of the image, and the text boxes are evenly arranged along thevertical direction, the present disclosure adopts a vertical projectionmethod, i.e., regardless of the vertical positions of the text boxes,counting the number of text boxes at each X position of the image (inother words, the number of text boxes on each vertical line passingthrough the positions) and forming an array of numbers of text boxes, inwhich the number of text boxes at each position in the X direction isstored. Then, a local maximum and a corresponding position are obtainedfrom the array. The array may have multiple local maximums, but most ofthe corresponding positions may not match the positions of thecoordinate axis hatch mark value text boxes and will be deleted. Becausecoordinate axes are generally located on either side of the data graph,and will not appear in the middle position, the middle position of thedata graph can be used as a judgment criterion. In the presentdisclosure, then, the difference between the average number of textboxes in the middle area in the X direction and the local maximum of thearray are obtained, and compared with a predetermined threshold, if adifference is larger than the predetermined threshold, it is judged thatthere is a Y-axis hatch mark value text box; otherwise there is noY-axis hatch mark value text box. The size of the middle area in the Xdirection in the array can be adjusted according to actual needs.Generally, the middle area can be obtained by expanding the exact middleof the graph by a certain percentage to each side. Finally, according tothe corresponding position of the local maximum, all text boxes at theposition are preliminarily determined as the Y-axis hatch mark valuetext boxes, and a list of Y-axis hatch mark value text boxes isobtained.

In addition, if there are maximums and corresponding positions meetingthe criterion on both sides, a list of left Y-axis hatch mark value textboxes and a list of right Y-axis hatch mark value text box are obtained.

1-2: In order to remove mistakes where text boxes such as graph titlesand copyright declarations are identified as Y-axis hatch mark valuetext boxes, the consistency of spacing between text boxes is used toidentify and judge, to remove text boxes that do not meet theconsistency requirements. The text box spacing consistency test isrealized by an independent method (noise data filtering, explained inthe next step). In this step, text box spacing is a filtering condition.

1-3: The noise data filtering method is based on a recognized fact thatmost of the numbers in the array have a small difference, while somehave a large difference. Specifically, this includes: comparing all thedata to be filtered in pairs, to find a data pair with the smallestvalue difference corresponding to the filter condition; if the valuedifference meets an error requirement calculating the average of thedata pair and determining the average of the data pair as a standardvalue, if the value difference does not meet the error requirement,determining that there is no number meeting the consistencyrequirements; then, calculating a difference between each of the rest ofthe data to be filtered and the standard value, determining a ratio ofthe difference to the standard value as a basis for measurement, andfiltering according to a threshold, where the filtering removes datathat exceeds the threshold.

Specifically, the filtering condition can be determined according toactual needs. For example, in the step 1-2, text box spacing is used asthe filter condition, then the spacing between adjacent text boxes is tobe calculated. Other filtering conditions are similar.

1-4: Similar to steps 1-1 to 1-3, horizontal projection is used toobtain a list of X-axis label text boxes.

Specifically, this is done by swapping X and Y, that is:

counting the number of text boxes at each Y position of the image, toobtain an array storing the number of text boxes at each position in theY direction. The number of text boxes at each position in the Ydirection can also be obtained by counting the number of text boxes on avertical line passing through the position. Then, a local maximum and acorresponding position are obtained from the array; the differencebetween the average number of text boxes in the middle area in the Ydirection and the local maximum of the array are obtained, and comparedwith a threshold, if a difference is within the threshold, it is judgedthat there is an X-axis hatch mark value text box at the correspondingposition of the local maximum. Finally, according to the correspondingposition of the local maximum, all text boxes at the position arepreliminarily determined as the X-axis hatch mark value text boxes, anda list of X-axis hatch mark value text boxes is obtained. Nonconformingtext boxes are removed by noise filtering in step 1-3.

1-5: Graph title text in a graph title text box is identified accordingto size characteristics of the graph title text box and positionaldistribution characteristics in the data graph. In the presentdisclosure, the graph title can be identified according to width orheight of the text box, or whether the text box is above or below thedata graph.

2. Locating of Coordinate Axes, and Locating of the Positions of HatchMarks on the Coordinate Axes

Locating of coordinate axes, and locating of the positions of hatchmarks on the coordinate axes refer to determining the positions of theY-axis and X-axis and determining the corresponding hatch mark valuetext box lists, and locating the positions of the hatch marks on thecoordinate axes. The specific process is as follows.

2-1: Locating of coordinate axes is performed on the data graph, whichincludes the following steps:

2-1-1: calculating a horizontal gradient and a vertical gradient of theimage, respectively;

2-1-2: determining edge pixels according to the horizontal gradient orvertical gradient result and a threshold;

2-1-3: counting the number of consecutive edge pixels in each column(horizontal gradient result) and the number of consecutive edge pixelsin each row (vertical gradient result); and determining an edge pixelcolumn whose number of consecutive edge pixels exceeds a set thresholdas a candidate Y-axis, and determining an edge pixel row whose number ofconsecutive edge pixels exceeds a set threshold as a candidate X-axis;

2-1-4: setting a minimum distance threshold between coordinate axes, andmerging adjacent candidate coordinate axes whose distance is less thanthe distance threshold, to solve the problem that one line may producetwo edge lines;

2-1-5: determining the coordinate axis and the axis hatch mark valuetext box lists according to the positional relationship between thecandidate coordinate axes and candidate axis hatch mark value text boxlists. Here, the positional relationship mainly includes: left Y-axistext boxes are to the left of the left Y-axis and have the same height;right Y-axis text boxes are to the right of the right Y-axis and havethe same height; X-axis label text boxes are below the X-axis and havethe same width.

2-2: Locating the positions of hatch marks on X-axis and Y-axis:

First, an axis area image is extracted, where the axis is at the centerand the width of the area image in a direction vertical to the axiscovers the entire axis and hatch marks on the axis. Then, the axis areaimage is binarized, where the coordinate axis and the hatch marks on thecoordinate axis are the foreground. Then, the number of foregroundpixels of the binarized image in a direction vertical to the axis arecounted in a row-by-row or column-by-column manner; whether it isrow-by-row or column-by-column depends on which axis is to be located.Then, a local maximum of the array from the counting are obtained anddetermined as the position of a candidate hatch mark. Finally, theobtained candidate hatch mark positions are filtered by the noise datafiltering method of step 1-3, to obtain actual hatch marks on eachcoordinate axis.

Specific steps include:

2-2-1: locating the positions of Y-axis hatch marks: first, extractingan image of a certain width with the X coordinate of the Y-axis as thecenter; then binarizing; then horizontally counting the binarized image;then, obtaining a local maximum and determining it the position of acandidate hatch mark.

2-2-2: locating the positions of X-axis hatch marks: first, extractingan image of a certain width with the Y coordinate of the X-axis as thecenter; then binarizing; then vertically counting the binarized image;then, obtaining a local maximum and determining it as the position of acandidate hatch mark.

2-2-3: filtering the obtained candidate hatch marks by the noise datafiltering method of step 1-3.

3. Legend Locating and Information Extraction

Legend locating and information extraction include obtaining, filtering,combining, and verifying legend connected components, and calculatingand identifying color of the legend, and the legend text boxcorresponding to the legend after the legend locating. Specifically,these include the following steps:

3-1: performing connected component analysis by calculating colordistances between adjacent pixels, to find all connected components withsimilar colors in the data graph; obtaining an average color value(e.g., RGB color average) for each connected component, as the color ofthe connected component; and counting the number of pixels in theconnected component and bounding rectangles.

3-2: filtering all the connected components according to the height,width, number of pixels, aspect ratio and compactness(compactness=number of connected component pixels/area of connectedcomponent bounding frame) and a threshold, to obtain candidate legendsmeeting a legend requirement. Generally, in the filtering, any connectedcomponent that does not meet the filtering condition should be deleted.

3-3: scanning all possible candidate legend connected components inpairs, so that two connected components meeting color and heightconsistency requirements (i.e., the color difference between the twolegends is less than a threshold, and the height difference is also lessthan a threshold) are combined into a new candidate legend;

3-4: determining a data area of the data graph according to the X-axisand Y-axis determined previously. When the data area has beendetermined, the remaining area in the data graph can be divided intofour parts, in relation to the data area: above, below, left and right.Steps S31 to S33 are performed on the three areas above, right and belowin relation to the data area respectively, to obtain all candidatelegends in these three areas. According to the number of candidatelegends in each of the three areas, the candidate legends in the areawith the largest number of candidate legends are selected as the actuallegends of the data graph.

3-5: performing layout analysis on the obtained actual legends accordingto the spatial positions of the legends, to determine whether thelegends in the data graph are arranged in a vertical, horizontal, orhybrid layout; and filtering out legends nonconforming with the layout;

3-6: according to the layout of the legends, searching for acorresponding legend text box for each legend from the data graph, andidentifying text characters and character color in each legend text box.

4. Extracting Corresponding Bar or Polyline Connected ComponentsAccording to Legend Color, and Filtering and Classification

Extracting corresponding bar or polyline connected components accordingto legend color and filtering and classification refer to obtainingconnected components with the same color as the legends from the datagraph according to the color of the legends, and filtering out somenoise connected components (smaller ones) according to a threshold; thenidentifying whether the connected components corresponding to thelegends are bars or a polyline according to the aspect ratio andcompactness; finally, obtaining the position, length and width of thebars and positional information of the foreground pixels. In the presentdisclosure, foreground pixels (pixels of the bars and the polyline) areextracted according to the color of the legends and by a nearestneighbor method.

Specifically, the process includes the following steps:

4-1: combining background color, character color in the text and legendcolor into a color list according to the background of the graph,characters in the text and legends obtained by the previous steps;scanning the pixels in the data area, and determining the one closest tothe color in the color list as the class of the pixel. Here, if there isno legend in the data graph, pixels in the data area that are neitherbackground color nor character color are extracted as a foregroundclass.

4-2: performing connected component analysis on pixels of each class,filtering the connected components according to a threshold, to filterout connected components with pixels smaller than the threshold, andobtaining a corresponding set of connected components for each legend inthe data area.

4-3: based on the height, width, number of pixels and compactness of theconnected components, scanning all the connected component setsaccording to a threshold, to determine for each set whether theconnected component is a bar, and if it is a bar, calculating thevariance of the heights of all bars and the variance of the widths ofall the bars in the graph, and determining whether the bars in the barchart are horizontal or vertical according to the variances. The widthsof the bars in the same chart are usually the same, but the lengthsvary; therefore, the variances reflect the direction in which the barsare. Then, the width of the bars is calculated (if the bars arehorizontal, the average height of the connected components iscalculated). If there is no bar, it is determined that the data graph isa line chart, and a line chart is a data graph with a vertical layout.

4-4: according to the layout direction type (vertical layout orhorizontal layout) of the data graph, identifying for each legendwhether a connected component set corresponding to the legend is a baror a polyline, and determining a classification axis and a numericalaxis in the data graph. The classification axis and numerical axis inthe data graph are determined by: when the data graph is in a verticallayout, determining the X-axis as the classification axis and the Y-axisas the numerical axis; when the data graph is in a horizontal layout,determining the Y-axis as the classification axis and the X-axis as thenumerical axis.

4-5: for all the connected components corresponding to the legendsidentified as bars, selecting a bar whose width meets the bar widthdescribed in S43 as a candidate bar for the legend, and analyzing thespatial positions and distances of all the bars to identify whetherthere is a bar divided by a polyline into two connected components, andif there is, recombining them into one;

4-6: for all the connected components corresponding to the legendsidentified as a line, obtaining a sequence of points on the linecorresponding to the X axis, i.e., data points on the polylinevertically corresponding to the positions of the X axis, and if thereare multiple data points corresponding to the same position, selecting arepresentative point (e.g., an average point) to remove extra pointscorresponding to the X-axis.

5. Determining Key Points on the X-Axis and Locating a CorrespondingX-Axis Label for Each Key Point

The specific process includes the following steps:

5-1: if there are hatch marks on the classification axis, sorting thembased on their positions according to the obtained hatch marks on theclassification axis; and determining the middle point of two adjacenthatch marks as a classification axis key point;

5-2: if there is no hatch mark on the classification axis, determiningthe middle point between classification axis hatch mark value text boxesas a classification axis key point; and filtering the obtainedclassification axis key points by the noise data filtering method ofsteps 1-3.

5-3: in other extreme cases, if there are no hatch marks and noclassification axis label text box, the number of legends is less than3, and there are bars in the data graph, determining the middle positionon the bottom of the classification axis as the key point position.

6. Locating key points of the bars and polyline according to the X-axiskey points, determining labeled numerical text boxes that correspond tothe key points, and identifying the numerical text. The specific processincludes the following steps:

6-1: determining key data points of the bars or polyline respectively,where the key data point of a vertical bar is the middle point of thetop edge of the bar, and the key data point of a horizontal bar is themiddle point of the far right of the bar, and the key data points of apolyline are the data points on the polyline vertically corresponding tothe key points of the classification axis;

6-2: according to the position of each key data point, the layout of thedata graph and the positions of the text boxes in the graph, searchingfor a corresponding labeled numerical text box for each key data pointby means of distance and setting a threshold;

6-3: identifying a labeled numerical value in each labeled numericaltext box by a digital recognition engine.

7. Calculating a Corresponding Value for Each Pixel, and EstimatingCorresponding Values of the Key Points of the Bars or Polyline

Calculating a corresponding value for each pixel, and estimatingcorresponding values of the key points of the bars or polyline refer toestimating a corresponding value for each pixel according to theobtained Y-axis hatch mark value text boxes, the positions of the Y-axishatch marks, the height of the bars and the corresponding text boxvalues. When the corresponding value of each pixel has been obtained,estimated values of the key point are calculated according to theobtained coordinate axes and the positions of the key points. In thepresent disclosure, the estimation of the corresponding value of eachpixel is mainly by locating the positions of the coordinate axis hatchmarks. Specifically, this includes the following steps:

7-1: matching according to the positional relationships between thehatch marks on the numerical axis and the labeled numerical text boxeson the numerical axis, and identifying the values in the labelednumerical text boxes on the numerical axis;

7-2: for any two adjacent hatch marks on the numerical axis, calculatinga corresponding value for each pixel according to the difference betweenthe number of pixels between two hatch marks and the difference betweenthe values in the corresponding labeled numerical text boxes, where thecalculated values form a single pixel corresponding value list;

7-3: filtering out noise from the single pixel corresponding value listby the noise data filtering method;

7-4: calculating the average value of the single pixel correspondingvalue list after the noise filtering, as a final corresponding value Mof the single pixel;

7-5: according to the obtained corresponding value M of the single pixeland the bar height H of the key data points, calculating an estimatedvalue for each key data point, where the bar height H of a data graph ina vertical layout is the distance from the key data point to the X-axis,and the bar height H of a data graph in a horizontal layout is thedistance from the key data point to the left Y axis.

This step also includes: determining whether the numerical axis is theX-axis or Y-axis according to the layout of the graph.

8. Determining a Final Result According to a Difference Between theEstimated Values and the Recognized Labeled Values, and Storing

For each key data point in the data graph, comparing the labeled valueobtained by step 6-3 with the estimated value obtained by step 7-5, andif within an error range, determining the recognition result correct,and determining the labeled value as the value of the key point;otherwise, determining the estimated value as the value of the keypoint.

Specifically, in this step, error estimation is performed on theestimated value (est_val) obtained at the key point of the bar orpolyline and the recognized result (reco_val) of the labeled valueobtained from key point scanning, i.e., error=2*abs(est_val-reco_val)/(est_val+reco_val), and if error is less than a giventhreshold, determining the recognition result correct and determiningthe recognition result of the labeled value as the value of the keypoint; otherwise, determining the estimated value as the value of thekey point.

Various elements and values finally identified in the above step can bestored to an Excel table, for easy viewing. The steps 1 to 8 will bedescribed further with reference to embodiments, so that those skilledin the art can better understand the implementation of the presentdisclosure. In the following embodiment, since each step has beendescribed in detail, parts of the implementation and principles areomitted.

EMBODIMENTS

Embodiments of the present disclosure will be further described below inconjunction with a method flowchart. FIG. 1 illustrates exemplaryelements of a data graph, for a better understanding of the terminologyof the present disclosure. FIG. 2 is a flow chart of a method accordingto the present disclosure, showing the relationships between the stepsof the present disclosure and its flow. FIG. 3 shows an inputtedoriginal data graph, sized 932*527. The figure includes bars and apolyline, in a vertical layout. There are two legends, two coordinatesaxes Y-axis and X-axis on the left and right respectively.

For this data graph, the method for automatic extraction of dataincludes the following steps.

1. Text Area Locating and Text Box Classification in a Data Graph

There are many text detection methods in the field of deep learning,such as: EAST: An Efficient and Accurate Scene Text Detector. In thisembodiment, a text detection method based on a CNN+LSTM model (DetectingText in Natural Image with Connectionist Text Proposal Network) is usedfor the text region locating. For the processed data graph, thisembodiment collects some samples and re-learns; the obtained result ofdetection by the trained model is shown in FIG. 4. The accuracy of textlocating is high; however, because there are many line and text-likeforeground targets, some errors do occur, including the problem of thelarge gaps between the located text box and the height of thecharacters.

Upon locating the positions of the text boxes, the text boxes areclassified, to facilitate further information extraction. The text boxesare classified mainly according to a positional relationship of the textbox and relationships between the text box and other objects. In theembodiment, the text box may be any of the 6 types: Y-axis hatch markvalue text box (left or right), X-axis label text box, graph title,labeled numerical text box, and other text boxes. Y-axis hatch markvalue text boxes are arranged vertically, distributed in the samehorizontal position area, and the number is large. X-axis label textboxes are basically distributed in the same height area, i.e., arrangedhorizontally, and the number is large. The other text boxes are highlyrandomly distributed according to the height of the bar or the polyline,exhibit no consistency. Therefore, classification can be performed bythe positional relationships and distribution patterns of the textboxes. The specific process of classification is as follows:

Because Y-axis hatch mark value text boxes are distributed on the leftor right side of the data graph, the positions of the text boxes are allin a certain interval in the X direction of the image, and the textboxes are evenly arranged along the vertical direction, this embodimentadopts a vertical projection method, i.e., regardless of the verticalpositions of the text boxes, counting the number of text boxes at each Xposition of the image and forming an array of numbers of text boxes.Then, a local maximum and a corresponding position are obtained from thearray. FIG. 5 shows a result from vertical projection of the text boxesaccording to the embodiment. In FIG. 5, the Y-axis is the number of textboxes, and the X-axis is the X position in the graph [0, image width].It can be seen that maximum values correspond to these horizontalpositions: [89, 108], [177, 192], [248, 261], [415, 425], [568, 598],[601, 668], [697, 749], [818, 893].

Next, all text boxes including the position of the local maximum aredetermined as Y-axis hatch mark value text boxes, according to theposition of the local maximum. That is, those text boxes having a leftborder smaller than the position of the maximum and a right borderlarger than the position of the maximum are determined as the Y-axisnumerical text boxes. Then, the differences between the number ofmaximum text boxes in the vertical projection of the middle area and thelocal maximum are obtained and compared with a given threshold, if adifference is larger than the threshold, it is determined that there isa Y-axis hatch mark value text box. In this figure, the number ofmaximums in the middle area is 3, the number of maximums in the leftarea is 8 (i.e., 8 text boxes are arranged vertically) and the number ofmaximums in the right area is 10. By comparing the numbers of maximumsin the left and right areas and the middle area, it is determined thatthere are left Y-axis and right Y-axis numerical text boxes.

In order to remove mistakes where text boxes such as graph titles andcopyright declarations are identified as Y-axis hatch mark value textboxes, in this embodiment, the consistency of spacing between text boxesis used to identify and judge, to remove text boxes that do not meet theconsistency requirements. The spacing between text boxes is thedifference between the heights of the centers of adjacent text boxes. 10text boxes will result in 9 spacings. The text box spacing consistencytest is realized by an independent method (noise data filtering). Themethod is based on a recognized fact that most of the numbers in thearray have a small difference, while some have a large difference.Specifically, this includes: comparing the data in pairs, to find a datapair with the smallest difference, and determining whether the distancemeets a requirement, and if so, calculating the average of the data pairand determining the average of the data pair as a standard value; ifnot, determining that there is no number meeting the consistencyrequirements; then, calculating a difference between each of the rest ofthe data and the standard value, determining a ratio of the differenceto the standard value as a basis for measurement, and filteringaccording to a threshold. The threshold used in this embodiment is 0.1.

The classification method for the Y-axis numerical text boxes above canbe used again, with little modification to a horizontal projectionmethod, i.e., the numbers of text boxes are projected to the Y directionof the image, to obtain local maximums, and further obtain an X-axislabel text box list. The result from the text box classification isshown in FIG. 6. The boxes on the left are identified as left Y-axishatch mark value text boxes, the boxes on the right are identified asright Y-axis hatch mark value text box, and boxes at the bottom areidentified as X-axis label text boxes.

Graph title can be identified according to width or height of the textbox, or whether the text box is above or below the data graph. Becausegraph title generally contains a larger number of characters, the widthof its text box is also larger than other text boxes; moreover, graphtitle generally has a font size larger than the that of other textboxes, correspondingly the height of the text box of graph tile islarger than other text boxes. In addition, graph title is in an upper orlower position of the graph. Therefore, similar to the classification ofY-axis numerical text boxes, in the present disclosure, the whole graphis divided into three parts: upper, middle, and lower, and a maximumheight and a maximum width of text boxes in each part are obtained. Ifthere is a text box in the upper or lower part of the graph having aheight larger than the sum of the maximum height of text boxes in themiddle part plus a threshold, and the text box in the upper or lowerpart of the graph has a width larger than the sum of the maximum widthof text boxes in the middle part plus a threshold, it is determined thatthe text box is a graph title box.

2. Locating of Coordinate Axes, and Locating of the Positions of HatchMarks on the Coordinate Axes

(1) Locating Coordinate Axes, and Determining Corresponding Hatch MarkValue Text Box Lists

Locating of coordinate axes refers to determining the positions of theY-axis and X-axis, i.e., determining the positions of the two endpoints(x1, y1) and (x2, y2) of the line segments in the graph coordinatesystem. Because the coordinate axes are either vertical (Y-axis) orhorizontal (X-axis), x1 and x2 of the two ends of the Y-axis are equal,and y1 and y2 of the two ends of the X-axis are equal. Many existingedge detection methods can be used to detect the coordinate axes, e.g.,canny. Because the Y-axis in the data graph is vertical, its edges aremainly composed of gradients in the horizontal direction; the X axis ishorizontal, so its edges are mainly composed of gradients in thevertical direction. Therefore, this embodiment calculates horizontal andvertical gradients of the image respectively, determines edge pixelsaccording to the horizontal or vertical gradient result and a threshold,count the number of consecutive edge pixels in each column (horizontalgradient result) or each row (vertical gradient result), and determinescandidate coordinate axes according to a threshold. In this embodiment,the threshold for the number of consecutive edge pixels for the Y-axisis defined as image height*0.5, and consecutive edge pixels with a countlarger than the threshold is a candidate coordinate axis; the thresholdfor the number of consecutive edge pixels for the X-axis is defined asimage width*0.5. In the grayscale image, pixels with a horizontal orvertical gradient larger than 15 are considered edge pixels.

Since one coordinate axis may produce two edge lines, in the embodimenta distance threshold is used to merge two adjacent candidate coordinateaxes into one: for the Y-axis, determining whether the differencebetween the x1 coordinates of the two candidate coordinate axes is lessthan a given threshold; for the X-axis, determining whether thedifference between the y1 coordinates of the two candidate X axes issmaller than a given threshold. In this embodiment, the threshold is setto 5. When a coordinate axis is merged, the intermediate value is takenas a new x1 (for Y-axis) or a new y1 (for X-axis). In this embodiment,the result of merging candidate Y-axes is shown in FIG. 7, where thevertical lines are the located candidate Y-axes.

As can be seen from the obtained candidate Y-axis, the edge of a bar isalso considered a Y-axis. Therefore, the positions of the candidatecoordinate axes and their relationships with candidate coordinate axishatch mark value text box lists are used to determine the coordinateaxes and the coordinate axis hatch mark value text box lists. Thepositional relationships mainly include: left axis text boxes are to theleft of the left Y-axis and have the same height; right axis text boxesare to the right of the right Y-axis and have the same height; X-axislabel text boxes are below the X-axis and have the same width. Noisecoordinate axes can be well removed by the constraint positionalrelationship with the coordinate axis hatch mark text boxes.

(2) Locating the Positions of Hatch Marks on the Coordinate Axes

Coordinate axis hatch marks refer to small black dots on the axes. Thelocating of the positions of them is mainly used to calculate thecorresponding value of each pixel. In this embodiment, the positions ofY-axis hatch marks are taken as an example. The locating of X-axis hatchmarks is similar to that of the Y-axis.

First, extracting an image with a width of 15 pixels, a height of [y1,y2], and centered at x1 of the Y-axis ((x1, y1), (x2, y2)); then,binarizing the image. Because hatch marks are protruding dots, there aremore foreground pixels. Therefore, the number of foreground pixels ofthe binarized image are counted horizontally. Then local maximums areobtained and determined as the positions of the candidate hatch marks.Finally, the obtained candidate hatch mark positions are filtered by anoise data filtering method. Intermediate results of the calculation areshown in FIG. 8, where the drawings from left to right are: extractedcoordinate axis, result from binarization, result from horizontalprojection of foreground pixels, and result from the locating.

The locating of X-axis hatch marks is similar to that of the Y-axis,which includes: extracting an image of a certain height centered at theY coordinate (y1) of the X-axis; binarizing the image; counting thebinarized image vertically; then, obtaining local maximums as thepositions of the candidate hatch marks. Finally, a noise data filteringmethod is used to filter the obtained candidate mark marks.

3. Legend Locating and Information Extraction

Legend locating and information extraction mainly include obtaining,filtering, combining, and verifying legend connected components, andcalculating and identifying color of the legend, and the legend text boxcorresponding to the legend after the legend locating. Specifically,these include the following steps:

(1) Performing connected component analysis by calculating colordistances between adjacent pixels. This process includes calculating thecolor distances between adjacent pixels using 4-connectivity. In theembodiment, the distance formula for two colors (r1, g1, b1) and (r2,g2, b2) is: distance=abs (r1−r2)+abs (b1−b2)+abs (g1−g2). It isconsidered connected when the distance is less than a given threshold.Through continuous iterations, connected components with similar colorsare found. Once a connected component is labeled, an average RGB valueof the connected component is obtained as the color of the connectedcomponent, and the number of pixels of the connected component and thebounding rectangle are counted.

(2) Filtering the obtained connected components according to height,width, number of pixels, aspect ratio and compactness. The filtering ismainly based on a threshold, to delete connected components that areleast unlikely to be connected components of the legend. In theembodiment the thresholds are defined as: number of pixels >16, andwidth >1, and height <width*1.5, and width <image width*0.2, and numberof pixels/(width*height) >0.85. Connected components that meet all theseconditions are determined as candidate legends.

(3) Because one legend may be represented by multiple connectedcomponents, the multiple connected components are combined into onelegend. It is assumed here that the color and height of the connectedcomponents of the same legend are the same. This step includes: scanningall possible candidate legend connected component pairs, and if thecolor distance is less than a given threshold and the center height oftwo connected components is less than a given threshold, determiningthat the two connected components can be combined. The combinedconnected components are a new legend. The bounding rectangle of the newlegend is composed of the bounding rectangles of the two connectedcomponents; the number of pixels is the sum of the pixels of the twoconnected components; and the color is the average of the colors of thetwo connected components.

(4) In order to eliminate the effect of the bars and the polyline in thegraph, and because most legends are in the three areas of the graph:upper, middle and lower, the legend extraction of the embodiment mayinclude: extracting legends from only the upper, middle and lower areasof the graph respectively; and determining the legends of the wholegraph according to the number of legends obtained in respective areas,where the legends in the part with the largest number of legends areselected as the correctly located legends. FIG. 9 is the legend resultobtained from the upper area, marked in black. The bounding rectangle(left, top, right, bottom), color (in BGR) and number of pixels of theleft legend are [183, 61, 258, 78], [150, 64, 2] and 2589 respectively.The bounding rectangle, color and number of pixels of the right legendare [525, 70, 595, 73], [192, 192, 192] and 492 respectively.

(5) There are generally more than one legend, and the distribution ofmultiple legends follows a pattern, which mainly includes: verticallayout, horizontal layout and hybrid layout. Determination of the layoutof the legends can be used to filter out legends nonconforming with thelayout, and to find the text boxes corresponding to the legends. Thedetermining the legend layout mainly includes: comparing in pairs, toput legends with a height difference less than a given threshold intothe same array, and scanning all the legends to obtain a list of legendarrays of different heights. If there is only one array in the list, itis determined that the legend layout is horizontal; if there aremultiple arrays and each array contains multiple legends, it isdetermined that the legend layout is hybrid; if the array in the listhas only one legend, it is determined that legend layout is vertical. Inthis embodiment, the legend layout is identified as horizontal. When thelegend layout has been determined, if it is hybrid or vertical, it isdetermined whether the height between two legends in different rowsmeets a given threshold, so as to remove some cases where specialcharacter connected components (e.g., the Chinese character for numberone:

) are identified as legends.

(6) According to the layout of the legends, searching for correspondinglegend text boxes. The text box corresponding to a legend is generallyon the right side of the legend and at the same height. A simple methodis to find the text box to the right according to the position of thelegend. Because a legend is often a colored line, it can easily beidentified as part of the text box by text locating, in which case it isdesirable that the legend and the text box corresponding to the legendbe re-divided. The embodiment includes: judging whether to re-divide thetext box by judging whether the bounding frame of the legend intersectsthe text box, and if so, dividing the text box into two parts accordingto left and right borders of the bounding frame of the legend. If asection is too narrow, it is not a valid text box. When the text boxcorresponding to the legend has been detected, the characters in thetext box are recognized by a character recognition engine. In theembodiment, the obtained legend text box is shown in FIG. 10 (color ofthe box represents color of the corresponding legend). The recognitionresults are: “Enforcement costs (million yuan)” and “Percentage inoperating casts”. In this example, “costs” is recognized as “casts”mainly due to inaccurate text locating.

4. Extracting Corresponding Bar or Polyline Connected ComponentsAccording to the Legends, and Filtering and Classification

Connected components with the same color as the legends are obtainedfrom the data graph according to the color of the legends, and somenoise connected components (smaller ones) are filtered out according toa threshold. Then, it is identified whether the connected componentscorresponding to the legends are bars or a polyline according to theaspect ratio and compactness. Finally, the position, length and width ofthe bars and positional information of the foreground pixels areobtained. Specifically, the following steps are included:

(1) Extracting Foreground Pixels

Foreground pixels are the pixels representing the bars or polyline inthe image. In order to extract the connected components of the bars orpolyline, first foreground pixels in the image are extracted accordingto the color. Foreground pixels can be extracted using a thresholdmethod. Because there may be multiple foregrounds of different colors,and there may be foregrounds of different classes with similar colors,the embodiment uses a nearest neighbor method that combines backgroundcolor, character color in the text and legend color into a color list;scans the pixels in the data area, and determines the one closest to thecolor in the color list as the class of the pixel, and if there is nolegend, uses a non-background and non-character color in the data areaas a foreground class. Based on legend extraction, the foreground pixelresults obtained by the nearest neighbor method in this embodiment areshown in the following figure, where black pixels are non-foregroundpixels. It can be seen that many pixels at the edges of characters arerecognized as foreground, mainly because the foreground pixels are gray(192, 192, 192), and the edges of characters are blurred due to imagecompression and the like and color becomes closer to (192, 192, 192).

(2) Performing connected component analysis on pixels of each class, toobtain a corresponding set of connected components for each legend. Theclass here corresponds to the legend. Because connected componentscomposed of edge pixels that are recognized as foreground generally havea small number of pixels, the embodiment method filters according towhether the number of pixels of a connected component meets a thresholdrequirement. The threshold is set to 30.

(3) Because bars generally have a large width, more pixels, large heightand great compactness (number of pixels/area of bounding frame), athreshold method is used to determine whether a connected component is abar according to the number of pixels, height, width and compactness ofthe connected component for each of the connected components in theconnected component set. If there are bars, the variance of the heightsand the variance of the widths are calculated, and it is determinedwhether the bars are horizontal or vertical according to the variances.Then, the width of the bars is calculated, where if the bars arehorizontal, the width of the bars is the average height of the barconnected components; if the bars are vertical, the width of the bars isthe average width of the bar connected components. If there is no bar,it is determined that the data graph is a line chart, and a line chartis a data graph with a vertical layout, thus determining whether thedata graph is in a horizontal layout or vertical layout.

(4) According to the layout type (horizontal or vertical) of the datagraph, identifying for each legend whether the connected component listcorresponding to the legend is a bar or a polyline. Because some datagraphs are a mix of the two forms, bars and polyline, it is desirable todetermine whether the connected component list corresponding to a legendis a bar or a polyline. If there are bars, the average width of the barshas been obtained in step (4). This embodiment assumes that the widthsof all bars are the same. Therefore, the number of bars in the connectedcomponent list corresponding to a legend that meet a thresholdrequirement can be determined according to the width of the bars and athreshold. When the number of bars is larger than 2, it is determinedthat the whole connected component list is composed of bars; otherwise,it is a connected component of a line.

(5) For all the connected components corresponding to the legendsidentified as bars, selecting candidate bars according to the width ofthe bars, and combining according to the positions and distances, toeliminate the effect of a division by a polyline. As shown in FIG. 11,the third bar has been divided into two connected components by thepolyline; in this step, they are combined into a new bar. The number ofpixels of the new bar connected component is the sum of the two previousones; the top value of the bounding rectangle is the top value of theupper one; and the bottom value of the bounding rectangle is the bottomvalue of the lower one.

(6) For all the connected components corresponding to the legendsidentified as a line, obtaining a sequence of points on the linecorresponding to the X axis, and removing extra points corresponding tothe X-axis. Because there is noise and the lines are relatively thick,in the coordinate system there may be multiple lines of foregroundpixels corresponding to one x position. This step is to remove the extrapoints, by simply scanning every x coordinate in the coordinate system,and calculating an average y coordinate if there are multiple points. Apoint sequence such as (x, mean (y)) represents a polyline.

5. Determining Key Points on the X-Axis and Locating a CorrespondingX-Axis Label for Each Key Point

When the point sequence of the polyline has been determined, it isdetermined which of the points are key points (intersecting points inthe line chart). One approach is to detect straight lines by Houghtransform and calculate the intersections. However, some line charts usecurvy lines, which cannot be detected using a straight line detectionmethod. In the embodiment, the key points of a polyline are located bylocating the key points of the X-axis. The locating of X-axis key pointsincludes the following steps:

(1) In most graphs, the key points of the X-axis are in the middleposition between the positions of the X-axis hatch mark; therefore, theembodiment sorts the obtained hatch mark position sequence on the Xaxis, and determines a middle point of adjacent hatch mark positions asa key point of the X axis;

(2) If there is no hatch mark on the X axis, the embodiment uses amiddle point of the X-axis label text boxes as a key point of the Xaxis, and filters the obtained candidate key points by a noise datafiltering method.

(3) If there is no X-axis label text box, the number of legends is lessthan 3 and there are bars, the embodiment uses a middle position in theX direction of a bar as a key point position of the X-axis.

In an embodiment, there are X-axis hatch mark values in the input datagraph. The X-axis key points are determined according to the positionsof the hatch marks, as shown in FIG. 12, where circles indicate thepositions of the X-axis key point.

6. Locating Key Points of the Bars and Polyline According to the X-AxisKey Points, Determining Labeled Numerical Text Boxes that Correspond tothe Key Points, and Identifying the Numerical Text. The Specific ProcessIncludes the Following Steps:

(1) determining key data points of the bars or polyline respectively,where the key data point of a vertical bar is the middle point of thetop edge of the bar, and the key data point of a horizontal bar is themiddle point of the far right of the bar, and the key data points of apolyline are the data points on the polyline corresponding to the keypoints of the X-axis;

(2) according to the position of a key data point, the layout of thedata graph and the positions of the text boxes in the graph, searchingfor a corresponding labeled numerical text box for the key data point bymeans of distance and a threshold;

(3) identifying labeled numerical values in the labeled numerical textboxes by a digital recognition engine, because labeled values aregenerally numeric characters.

According to an embodiment, the located bars, key points of the polylineand (indicated by red small circles) and corresponding labeled textboxes (color of the rectangular frame corresponds to the legend color)are shown in FIG. 13. A recognition result of labeled numerical valuesof the bars (from left to right, with their actual values inparentheses) is: 18560 (18560), 25865 (25865), 32010 (32010), 18100(8100). The last value is misidentified due to interference from thelines. A recognition result of labeled numerical text boxescorresponding to the key points of the polyline is: 720% (7.20%), 710%(7.10%), 86,90% (6.90%), 670% (6.70%). It can be seen that theperformance of existing recognition methods on decimal points is notgood.

7. Calculating a Corresponding Value for Each Pixel, and EstimatingCorresponding Values of the Key Points of the Bars or Polyline

A corresponding value is estimated for each pixel according to theobtained Y-axis hatch mark value text boxes, the positions of the Y-axishatch marks, the height of the bars and the corresponding text boxvalues. When the corresponding value of each pixel has been obtained,estimated values of the key point are calculated according to theobtained coordinate axes and the positions of the key points. Thespecific process includes the following steps:

(1) A hatch value mark generally corresponds to a hatch mark value textbox. When the Y-axis and its corresponding hatch mark value text boxlist have been determined, the embodiment matches according to thepositional relationships between the positions of the Y-axis hatch marksand the Y-axis labeled numerical text boxes, and identifies the valuesin the labeled numerical text boxes by a value recognition engine.

(2) calculating a candidate every-pixel-corresponding-value listaccording to the difference between the number of pixels between twohatch marks and the difference between the values in the correspondinglabeled numerical text boxes, whereevery-pixel-corresponding-value=hatch mark corresponding valuedifference/hatch mark position height difference; and the hatch markcorresponding value difference is the difference between the recognitionresults of the numerical text boxes corresponding to two hatch marks,and the hatch mark position height difference is the difference betweenthe Y values of the positions of two hatch marks.

(3) filtering out noise from the every-pixel-corresponding-value list bya noise data filtering method (see the foregoing description)

(4) calculating the average value of theevery-pixel-corresponding-value, as a final corresponding value of eachpixel

(5) according to the obtained corresponding value of each pixel and theheight of the key data point (in vertical layout, it is the distancebetween the key point and the X-axis; in horizontal layout, it is thedistance between the key point and the left Y-axis), calculating anestimated value for each key data point. The estimated value=(height ofkey point−height of X-axis)*every-pixel-corresponding-value.

In an embodiment, the bars corresponding to the left Y-axis have anevery-pixel-corresponding-value: 103.09. The polyline corresponding tothe right Y-axis has an every-pixel-corresponding-value: 0.00256641. Theestimated values of the bars obtained according to the height of thebars (with actual values in parentheses) are: 18350.5 (18560), 25876.2(25865), 31958.7 (32010), 8144.3 (8100). The estimated valuescorresponding to the key points based on the height of the key points ofthe polyline (with actual values in parentheses) are: 0.71867 (7.20%),0.709444 (7.10%), 0.68970 (6.90%), 0.67072 (6.70%). It can be seen thatthe precision of the estimated values is lower than the recognitionresult of the numerical text boxes, but the accuracy is high, and therewill not be a situation where the recognized value is significantlydifferent from the actual value.

8. Determining a Final Result According to a Difference Between theEstimated Values and the Recognized Labeled Values, and Storing

Error estimation is performed on the estimated value (est_val) obtainedat a key point of the bars or polyline and the recognition result(reco_val) of a labeled numerical text box determined based on the keypoint, i.e., error=2*abs (est_val-reco_val)/(est_val+reco_val), and ifthe error is less than 0.1, it is determined that the recognition resultis correct and the recognition result of the labeled numerical text boxis used as the value of the key point; otherwise, the estimated value isused as the value of the key point. This step ensures the precision ofthe numerical identification while maintaining the accuracy of thenumerical values, thereby eliminating the problem of significant errorsbrought by misidentification.

In an embodiment, the recognition result of the text box of a legend isstored as a subject of a row, and the X-axis label recognition result isstored as a subject of a column, to an Excel table. Table 1 below showsan example.

TABLE 1 Recognition result of input graph 2016 2017

 2018 2019.01 Enforcement 18560 25865 32010 8144.330 costs (millionyuan) Percentage in 0.719 0.709 0.690 0.671 operating casts

To test the validity of the method for different types of data graphinput cases, the embodiment in FIG. 14 gives two other input data graphand their corresponding output results, which illustrates that themethod of the present disclosure has high accuracy and precision, and isapplicable to different types of data graphs, such as horizontallyarranged bar charts, mixed data graphs of polyline and bars, and datagraphs without coordinate axes.

The embodiments described above are only some preferred solution of thepresent disclosure, which shall not be construed as limiting the scopeof the present disclosure. Those skilled in the art can make variouschanges and modifications without departing from the spirit and scope ofthe present disclosure, which shall all fall within the scope of theinvention.

What is claimed is:
 1. A method for extracting element data from a datagraph with bars or polylines, comprising the steps of: (S1) text arealocating and text box classification in the data graph according tosteps S11 to S15: (S11) obtaining the data graph where data is to beextracted, locating all text boxes within the data graph by deeplearning, and performing character recognition; (S12) counting thenumber of text boxes at each position in an X direction of the datagraph, to obtain an array of numbers of text boxes at differentpositions in the X direction; obtaining a local maximum of the numbersof text boxes in the array and a corresponding position; obtaining thedifference between an average number of text boxes in a middle area inthe X direction and the local maximum of the array, and determiningthere is a Y-axis hatch mark value text box at the correspondingposition of the local maximum if the difference is within an threshold;and determining all text boxes at the corresponding position of thelocal maximum as the Y-axis hatch mark value text boxes according to thecorresponding position, to obtain a Y-axis hatch mark value text boxlist; (S13) performing a text box spacing consistency test on the Y-axishatch mark value text box list using a noise data filtering method, withthe spacing between adjacent text boxes being a filtering condition;(S14) obtaining an X-axis hatch mark value text box list; (S15)identifying graph title text in a graph title text box according to sizecharacteristics of the graph title text box and positional distributioncharacteristics in the data graph; (S2) locating of coordinate axes, andlocating of the positions of hatch marks on the coordinate axesaccording to steps S21 to S22: (S21) locating coordinate axes from thedata graph, which comprises: first calculating a horizontal gradient anda vertical gradient of the data graph, respectively, and determiningvertical edge pixels and horizontal edge pixels according to ahorizontal gradient result and a vertical gradient result, respectively;then counting the number of consecutive edge pixels in each column andthe number of consecutive edge pixels in each row, determining an edgepixel column whose number of consecutive edge pixels exceeds a setthreshold as a candidate Y-axis, and determining an edge pixel row whosenumber of consecutive edge pixels exceeds a set threshold as a candidateX-axis; then merging adjacent candidate coordinate axes whose distanceis less than a distance threshold; finally determining the coordinateaxis and axis hatch mark value text box lists according to a positionalrelationship between candidate coordinate axes and candidate axis hatchmark value text box lists; (S22) locating the positions of hatch markson X-axis and Y-axis sequentially, where each of the coordinate axes islocated by: first extracting a coordinate axis area image centered at acoordinate axis, where the width of the area image in a directionvertical to the coordinate axis covers the entire coordinate axis andhatch marks on the coordinate axis; then binarizing the coordinate axisarea image, where the coordinate axis and the hatch marks on thecoordinate axis are foreground; then counting foreground pixels in thebinarized image in a direction vertical to the coordinate axis in arow-by-row or column-by-column manner; then obtaining a local maximum ofan array obtained from the counting, as the position of a candidatehatch mark; finally filtering the obtained candidate hatch mark by thenoise data filtering method, to obtain an actual hatch mark on thecoordinate axis; (S3) legend locating and information extractionaccording to steps S31 to S36: (S31) performing connected componentanalysis by calculating color distances between adjacent pixels, to findall connected components with similar colors in the data graph;obtaining an average color value for each connected component, as thecolor of the connected component; and counting the number of pixels inthe connected component and bounding rectangles; (S32) filtering all theconnected components according to height, width, number of pixels,aspect ratio and compactness of the connected components using athreshold method, to obtain a candidate legend meeting a legendrequirement; (S33) scanning all possible candidate legend connectedcomponents in pairs, so that two connected components meeting color andheight consistency requirements are combined into a new candidatelegend; (S34) performing S31 to S33 on each of the three areas of thedata graph: above, to the right of, and below a data area, to obtain allcandidate legends in these three areas; selecting candidate legends inan area with the largest number of candidate legends as actual legendsof the data graph according to the number of candidate legends in eachof the three areas; (S35) performing layout analysis on the obtainedactual legends according to spatial positions of the actual legends, todetermine whether the actual legends in the data graph are arranged in avertical, horizontal or hybrid layout; and filtering out one or more ofthe actual legends nonconforming with the layout; (S36) according to thelayout of the actual legends, searching for a corresponding legend textbox for each actual legend from the data graph, and identifying textcharacters and character color from each legend text box; (S4)extracting corresponding bar or polyline connected components accordingto legend color, and filtering and classification according to steps S41to S45: (S41) combining background color, character color in the textand legend color into a color list of a variety of color classes;scanning the pixels in the data area of the data graph, calculatingcolor distances between the color of a pixel and the variety of colorsin the color list, and determining a color class having the smallestcolor distance as the class of the pixel; (S42) performing connectedcomponent analysis on pixels of each class, filtering the connectedcomponents by a threshold method, to obtain a corresponding set ofconnected components for each actual legend in the data area; (S43)based on the height, width, number of pixels and compactness of theconnected components, scanning all connected component sets according toa threshold, to determine for each connected component whether theconnected component is a bar, and if it is a bar, calculating thevariance of the heights of all bars and the variance of the widths ofall the bars in the data graph, determining whether the bars in the barchart are horizontal or vertical according to the variances, andcalculating the width of the bars; if there is no bar, determining thatthe data graph is a line chart, with a vertical layout; (S44) accordingto the layout direction type of the data graph, identifying for eachactual legend whether a connected component set corresponding to theactual legend is a bar or a polyline, and determining a classificationaxis and a numerical axis in the data graph; (S45) for all the connectedcomponents corresponding to actual legends identified as bars, selectinga bar whose width meets the bar width described in S43 as a candidatebar for the actual legend, and analyzing the spatial positions anddistances of all the bars to identify whether there is a bar divided bya polyline into two connected components, and if there is, recombiningthem into one; (S5) determining key points on the classification axisand locating a corresponding classification-axis label for eachclassification-axis key point according to the layout direction type ofthe data graph; (S6) locating key data points of the bars or polylineaccording to the classification-axis key points, determining acorresponding labeled numerical text box for each key data point, andidentifying the numerical text; (S7) calculating a corresponding valuefor each pixel according to the numerical axis, and estimatingcorresponding values for the key points of the bars or polyline; and(S8) for each key data point in the data graph, performing errorverification on the identified numerical value by the estimated value,to determine a final result.
 2. The method of claim 1, wherein the noisedata filtering method comprises: comparing all the data to be filteredin pairs, to find a data pair with the smallest value differencecorresponding to the filter condition; if the value difference meets anerror requirement, calculating the average of the data pair anddetermining the average of the data pair as a standard value; and thencalculating a difference between each of the rest of the data to befiltered and the standard value, and filtering out data with adifference exceeding a threshold.
 3. The method of claim 1, wherein instep S12, if there are Y-axis hatch mark value text boxes on both sidesof the data graph, it is determined that there are two Y-axes, left andright; and a left Y-axis hatch mark value text box list and a rightY-axis hatch mark value text box list are obtained.
 4. The method ofclaim 1, wherein in step S33, the new candidate legend has a boundingrectangle composed of the bounding rectangles of the two connectedcomponents, the number of pixels is the sum of the pixels of the twoconnected components, and the color is the average of the colors of thetwo connected components.
 5. The method of claim 1, wherein in step S44,the determining a classification axis and a numerical axis in the datagraph comprises: when the data graph is in a vertical layout,determining the X-axis as the classification axis and the Y-axis as thenumerical axis; when the data graph is in a horizontal layout,determining the Y-axis as the classification axis and the X-axis as thenumerical axis.
 6. The method of claim 1, wherein step S5 comprises:(S51) if there are hatch marks on the classification axis, sorting thembased on their positions according to the obtained hatch marks on theclassification axis; and determining a middle point of two adjacenthatch marks as a classification axis key point; and (S52) if there is nohatch mark on the classification axis, determining a middle pointbetween classification axis hatch mark value text boxes as aclassification axis key point; and filtering the obtained classificationaxis key points by the noise data filtering method.
 7. The method ofclaim 1, wherein step S6 comprises: (S61) determining key data points ofthe bars or polyline respectively, where the key data point of avertical bar is a middle point of the top edge of the bar, and the keydata point of a horizontal bar is a middle point of the far right of thebar, and the key data points of a polyline are the data points on thepolyline vertically corresponding to the key points of theclassification axis; (S62) according to the position of each key datapoint, the layout of the data graph and the positions of the text boxesin the data graph, searching for a corresponding labeled numerical textbox for each key data point; and (S63) identifying a labeled numericalvalue in each labeled numerical text box.
 8. The method of claim 1,wherein step S7 comprises: (S71) matching according to the positionalrelationships between the hatch marks on the numerical axis and thelabeled numerical text boxes on the numerical axis, and identifying thevalues in the labeled numerical text boxes on the numerical axis; (S72)for any two adjacent hatch marks on the numerical axis, calculating acorresponding value for each pixel according to the difference betweenthe number of pixels between two hatch marks and the difference betweenthe values in the corresponding labeled numerical text boxes, where thecalculated values form a single pixel corresponding value list; (S73)filtering out noise from the single pixel corresponding value list bythe noise data filtering method; (S74) calculating an average value ofthe single pixel corresponding value list after the noise filtering, asa final corresponding value M of the single pixel; and (S75) accordingto the obtained corresponding value M of the single pixel and the barheight H of the key data points, calculating an estimated value for eachkey data point, where the bar height H of a data graph in a verticallayout is the distance from the key data point to the X-axis, and thebar height H of a data graph in a horizontal layout is the distance fromthe key data point to the left Y axis.
 9. The method of claim 8, whereinstep S6 comprises: (S61) determining key data points of the bars orpolyline respectively, where the key data point of a vertical bar is amiddle point of the top edge of the bar, and the key data point of ahorizontal bar is a middle point of the far right of the bar, and thekey data points of a polyline are the data points on the polylinevertically corresponding to the key points of the classification axis;(S62) according to the position of each key data point, the layout ofthe data graph and the positions of the text boxes in the data graph,searching for a corresponding labeled numerical text box for each keydata point; and (S63) identifying a labeled numerical value in eachlabeled numerical text box; and wherein step S8 comprises: for each keydata point in the data graph, comparing the labeled value obtained bystep S63 with the estimated value obtained by step S75, and if within anerror range, determining the recognition result correct and determiningthe labeled value as the value of the key point; otherwise, determiningthe estimated value as the value of the key point.